Skip to content

feat(workflow): finish agent workflow CLI and RQ1 experiment#39

Closed
cm2435 wants to merge 4 commits intomainfrom
feature/finish-agent-workflow-cli
Closed

feat(workflow): finish agent workflow CLI and RQ1 experiment#39
cm2435 wants to merge 4 commits intomainfrom
feature/finish-agent-workflow-cli

Conversation

@cm2435
Copy link
Copy Markdown
Contributor

@cm2435 cm2435 commented Apr 26, 2026

Summary

  • Finish the workflow CLI task/resource surfaces for agents, including async-safe workflow tool execution and manager-capable graph edits.
  • Run and document the overnight ResearchRubrics RQ1 CLI-specialism hillclimb, with best artifact tests/real_llm/.rollouts/20260426T234424Z-7fc055f5-03c3-4cab-8117-04e844696482/.
  • Fix real-LLM harness/runtime blockers uncovered during the experiment: vanilla ResearchRubrics prompt loading, provider env propagation, context import cycles, failed-run artifact dumping, and dynamic task/ready scheduling for added subtasks.

Experiment Result

  • Best run: Variant 1f, run ID 7fc055f5-03c3-4cab-8117-04e844696482.
  • Evidence: 5/5 root tasks completed, 5/5 evaluations landed, 5 roots created exactly 15 specialist children, and no recursive child creation after the current-task-level prompt fix.
  • Caveat: run-level status is still failed because advisory specialist child failures propagate; this is documented as the next design choice.

Test plan

  • uv run pytest tests/unit/runtime/test_workflow_service.py tests/unit/cli/test_workflow_cli.py tests/unit/state/test_workflow_cli_tool.py tests/unit/state/test_research_rubrics_workers.py tests/unit/state/test_research_rubrics_benchmark.py tests/unit/runtime/test_import_boundaries.py -q
  • Real-LLM artifact runs documented in docs/experiments/rq1-cli-specialism/changelog.md
  • Backend harness check for best run: GET /api/test/read/run/7fc055f5-03c3-4cab-8117-04e844696482/state returned 20 graph nodes, 70 mutations, 5 evaluations, and 5 resources

@github-actions
Copy link
Copy Markdown

E2E smoke — swebench-verified

Screenshots pushed to screenshots/pr-39.

swebench-verified/9310f7f5-7ee3-4780-a493-efa073846768-activity-stack.png
swebench-verified/9310f7f5-7ee3-4780-a493-efa073846768-sad.png
swebench-verified/9310f7f5-7ee3-4780-a493-efa073846768-visual-debugger-full.png
swebench-verified/cohort-ci-smoke-swebench-verified-20260426T222152.png

@github-actions
Copy link
Copy Markdown

E2E smoke — minif2f

Screenshots pushed to screenshots/pr-39.

minif2f/0958f7c8-d6ab-4be3-8b5f-5e61d622e476-activity-stack.png
minif2f/0958f7c8-d6ab-4be3-8b5f-5e61d622e476-sad.png
minif2f/0958f7c8-d6ab-4be3-8b5f-5e61d622e476-visual-debugger-full.png
minif2f/cohort-ci-smoke-minif2f-20260426T222155.png

@github-actions
Copy link
Copy Markdown

E2E smoke — researchrubrics

Screenshots pushed to screenshots/pr-39.

researchrubrics/cohort-ci-smoke-researchrubrics-20260426T222201.png
researchrubrics/ea4ec05f-15e9-469c-b311-5f4123f57178-activity-stack.png
researchrubrics/ea4ec05f-15e9-469c-b311-5f4123f57178-sad.png
researchrubrics/ea4ec05f-15e9-469c-b311-5f4123f57178-visual-debugger-full.png

@github-actions
Copy link
Copy Markdown

E2E smoke — swebench-verified

Screenshots pushed to screenshots/pr-39.

swebench-verified/bacb37b3-13bc-4761-a460-2f262818f5d7-activity-stack.png
swebench-verified/bacb37b3-13bc-4761-a460-2f262818f5d7-sad.png
swebench-verified/bacb37b3-13bc-4761-a460-2f262818f5d7-visual-debugger-full.png
swebench-verified/cohort-ci-smoke-swebench-verified-20260426T223940.png

@github-actions
Copy link
Copy Markdown

E2E smoke — minif2f

Screenshots pushed to screenshots/pr-39.

minif2f/cohort-ci-smoke-minif2f-20260426T224003.png
minif2f/d3dbb5d8-d486-4ca5-b5d8-0e1469c3b277-activity-stack.png
minif2f/d3dbb5d8-d486-4ca5-b5d8-0e1469c3b277-sad.png
minif2f/d3dbb5d8-d486-4ca5-b5d8-0e1469c3b277-visual-debugger-full.png

@github-actions
Copy link
Copy Markdown

E2E smoke — researchrubrics

Screenshots pushed to screenshots/pr-39.

researchrubrics/473cd440-45b6-45ae-a613-eec2dd2dacf6-activity-stack.png
researchrubrics/473cd440-45b6-45ae-a613-eec2dd2dacf6-sad.png
researchrubrics/473cd440-45b6-45ae-a613-eec2dd2dacf6-visual-debugger-full.png
researchrubrics/cohort-ci-smoke-researchrubrics-20260426T224005.png

Document the overnight RQ1 hillclimb and fix the runtime paths needed for workflow-CLI agents to spawn and schedule specialist subtasks during real-LLM rollouts.

Made-with: Cursor
@cm2435 cm2435 changed the title feat(workflow): finish agent workflow CLI task editing feat(workflow): finish agent workflow CLI and RQ1 experiment Apr 27, 2026
@github-actions
Copy link
Copy Markdown

E2E smoke — researchrubrics

Screenshots pushed to screenshots/pr-39.

researchrubrics/b5c0e5df-7d5c-454c-9d14-5f69a73a1ee0-activity-stack.png
researchrubrics/b5c0e5df-7d5c-454c-9d14-5f69a73a1ee0-sad.png
researchrubrics/b5c0e5df-7d5c-454c-9d14-5f69a73a1ee0-visual-debugger-full.png
researchrubrics/cohort-ci-smoke-researchrubrics-20260427T083731.png

@github-actions
Copy link
Copy Markdown

E2E smoke — minif2f

Screenshots pushed to screenshots/pr-39.

minif2f/25e84063-88dc-4137-8e4a-17c0f2877708-activity-stack.png
minif2f/25e84063-88dc-4137-8e4a-17c0f2877708-sad.png
minif2f/25e84063-88dc-4137-8e4a-17c0f2877708-visual-debugger-full.png
minif2f/cohort-ci-smoke-minif2f-20260427T083736.png

Centralize model resolution in core so provider-prefixed cloud targets use OpenRouter consistently, while preserving typed logprob payloads without the previous API import cycle.

Made-with: Cursor
@github-actions
Copy link
Copy Markdown

E2E smoke — researchrubrics

Screenshots pushed to screenshots/pr-39.

researchrubrics/786b2153-46f5-44af-aaa2-787254eb22d8-activity-stack.png
researchrubrics/786b2153-46f5-44af-aaa2-787254eb22d8-sad.png
researchrubrics/786b2153-46f5-44af-aaa2-787254eb22d8-visual-debugger-full.png
researchrubrics/cohort-ci-smoke-researchrubrics-20260427T093801.png

@github-actions
Copy link
Copy Markdown

E2E smoke — minif2f

Screenshots pushed to screenshots/pr-39.

minif2f/8fb4b9a4-82a6-4277-adb0-40c6e6e27583-activity-stack.png
minif2f/8fb4b9a4-82a6-4277-adb0-40c6e6e27583-sad.png
minif2f/8fb4b9a4-82a6-4277-adb0-40c6e6e27583-visual-debugger-full.png
minif2f/cohort-ci-smoke-minif2f-20260427T093802.png

@github-actions
Copy link
Copy Markdown

E2E smoke — swebench-verified

Screenshots pushed to screenshots/pr-39.

swebench-verified/938de84d-08e6-4e28-9f3d-09754e57d097-activity-stack.png
swebench-verified/938de84d-08e6-4e28-9f3d-09754e57d097-sad.png
swebench-verified/938de84d-08e6-4e28-9f3d-09754e57d097-visual-debugger-full.png
swebench-verified/cohort-ci-smoke-swebench-verified-20260427T093806.png

@cm2435 cm2435 closed this Apr 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant